ggplot(summaries_1, aes(x = health, y = mean, col =factor(exerany))) +geom_point(size =2) +geom_line(aes(group =factor(exerany))) +scale_color_viridis_d(option ="C", end =0.5) +labs(title ="Observed Means of BMI",subtitle ="by Exercise and Overall Health")
Note the use of factor here since the exerany variable is in fact numeric, although it only takes the values 1 and 0.
Sometimes it’s helpful to treat 1/0 as a factor, and sometimes not.
Where is the evidence of serious non-parallelism (if any) in the plot on the next slide that results from this code?
This looks very different from our previous version of the model.
What happens when we make a prediction, though?
Prediction in the Orthogonal Polynomial Model
Remember that in our raw polynomial model, our “by hand” and “using R” calculations both concluded that the predicted bmi for a subject with fruit_day = 2 was 27.466.
Now, what happens with the orthogonal polynomial model temp2 we just fit?
The main reason is to avoid having to include powers of our predictor that are highly collinear.
Variance Inflation Factor assesses collinearity…
vif(temp1) ## from rms package
fruit_day I(fruit_day^2)
4.652178 4.652178
Orthogonal polynomial terms are uncorrelated with one another, easing the process of identifying which terms add value to our model.
vif(temp2)
poly(fruit_day, 2)1 poly(fruit_day, 2)2
1 1
Why orthogonal rather than raw polynomials?
The tradeoff is that the raw polynomial is a lot easier to explain in terms of a single equation in the simplest case.
Actually, we’ll usually avoid polynomials in our practical work, and instead use splines, which are more flexible and require less maintenance, but at the cost of pretty much requiring you to focus on visualizing their predictions rather than their equations.
Did the polynomial term in m_3 and m_3int improve our predictions?
Splines
A linear spline is a continuous function formed by connecting points (called knots of the spline) by line segments.
A restricted cubic spline is a way to build highly complicated curves into a regression equation in a fairly easily structured way.
A restricted cubic spline is a series of polynomial functions joined together at the knots.
Such a spline gives us a way to flexibly account for non-linearity without over-fitting the model.
Restricted cubic splines can fit many different types of non-linearities.
Specifying the number of knots is all you need to do in R to get a reasonable result from a restricted cubic spline.
The most common choices are 3, 4, or 5 knots.
3 Knots, 2 degrees of freedom, allows the curve to “bend” once.
4 Knots, 3 degrees of freedom, lets the curve “bend” twice.
5 Knots, 4 degrees of freedom, lets the curve “bend” three times.
A simulated data set
set.seed(4322021)sim_data <-tibble(x =runif(250, min =10, max =50),y =3*(x-30) -0.3*(x-30)^2+0.05*(x-30)^3+rnorm(250, mean =500, sd =70))head(sim_data, 2)
# A tibble: 2 × 2
x y
<dbl> <dbl>
1 42.5 397.
2 35.9 414.
The sim_data, plotted.
Fitting Restricted Cubic Splines with lm and rcs
sim_linear <-lm(y ~ x, data = sim_data)sim_poly2 <-lm(y ~poly(x, 2), data = sim_data)sim_poly3 <-lm(y ~poly(x, 3), data = sim_data)sim_rcs3 <-lm(y ~rcs(x, 3), data = sim_data)sim_rcs4 <-lm(y ~rcs(x, 4), data = sim_data)sim_rcs5 <-lm(y ~rcs(x, 5), data = sim_data)
Looking at the Polynomial Fits
Looking at the Restricted Cubic Spline Fits
Fitting Restricted Cubic Splines with lm and rcs
For most applications, three to five knots strike a nice balance between complicating the model needlessly and fitting data pleasingly. Let’s consider a restricted cubic spline model for bmi based on fruit_day again, but now with:
in temp3, 3 knots, and
in temp4, 4 knots,
temp3 <-lm(bmi ~rcs(fruit_day, 3), data = train_c4im)temp4 <-lm(bmi ~rcs(fruit_day, 4), data = train_c4im)
female is based on biological sex (1 = female, 0 = male)
exerany comes from a response to “During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?” (1 = yes, 0 = no, don’t know and refused = missing)
The variable is based on “Would you say that in general your health is …” using the five specified categories (Excellent -> Poor), numbered for convenience after data collection.
Don’t know / not sure / refused were each treated as missing.
Might want to run a sanity check here, just to be sure…
Checking health vs. genhealth in class4
class4 |>tabyl(genhealth, health) |>adorn_title()
health
genhealth E VG G F P NA_
1_Excellent 148 0 0 0 0 0
2_VeryGood 0 324 0 0 0 0
3_Good 0 0 274 0 0 0
4_Fair 0 0 0 112 0 0
5_Poor 0 0 0 0 35 0
<NA> 0 0 0 0 0 1
OK. We’ve preserved the order and we have much shorter labels. Sometimes, that’s helpful.
Multicategorical race_eth in raw class4
class4 |>count(race_eth)
# A tibble: 6 × 2
race_eth n
<fct> <int>
1 Black non-Hispanic 167
2 Hispanic 27
3 Multiracial non-Hispanic 19
4 Other race non-Hispanic 22
5 White non-Hispanic 646
6 <NA> 13
“Don’t know”, “Not sure”, and “Refused” were treated as missing.
What is this variable actually about?
What is the most common thing people do here?
What is the question you are asking?
Collapsing race_eth levels might be rational for some questions.
We have lots of data from two categories, but only two.
Systemic racism affects people of color in different ways across these categories, but also within them.
Is combining race and Hispanic/Latinx ethnicity helpful?
It’s hard to see the justice in collecting this information and not using it in as granular a form as possible, though this leaves some small sample sizes. There is no magic number for “too small a sample size.”
Most people identified themselves in one of the categories.
These data are not ordered, and (I’d argue) ordering them isn’t helpful.
Regression models are easier to interpret, though, if the “baseline” category is a common one.
Resorting the factor for race_eth
Let’s sort all five levels, from most observations to least…